Introduction

Part 1 Recap

In our Part 1, we analyzed a hotel booking dataset from the INN Hotels Group to understand the factors influencing booking cancellations.

hotel_data <- read.csv("../../Dataset/INNHotelsGroup_min.csv")
hotel_datay <- read.csv("../../Dataset/INNHotelsGroup.csv")
str(hotel_datay)
## 'data.frame':    36275 obs. of  19 variables:
##  $ Booking_ID                          : chr  "INN00001" "INN00002" "INN00003" "INN00004" ...
##  $ no_of_adults                        : int  2 2 1 2 2 2 2 2 3 2 ...
##  $ no_of_children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_weekend_nights                : int  1 2 2 0 1 0 1 1 0 0 ...
##  $ no_of_week_nights                   : int  2 3 1 2 1 2 3 3 4 5 ...
##  $ type_of_meal_plan                   : chr  "Meal Plan 1" "Not Selected" "Meal Plan 1" "Meal Plan 1" ...
##  $ required_car_parking_space          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ room_type_reserved                  : chr  "Room_Type 1" "Room_Type 1" "Room_Type 1" "Room_Type 1" ...
##  $ lead_time                           : int  224 5 1 211 48 346 34 83 121 44 ...
##  $ arrival_year                        : int  2017 2018 2018 2018 2018 2018 2017 2018 2018 2018 ...
##  $ arrival_month                       : int  10 11 2 5 4 9 10 12 7 10 ...
##  $ arrival_date                        : int  2 6 28 20 11 13 15 26 6 18 ...
##  $ market_segment_type                 : chr  "Offline" "Online" "Online" "Online" ...
##  $ repeated_guest                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_price_per_room                  : num  65 106.7 60 100 94.5 ...
##  $ no_of_special_requests              : int  0 1 0 0 0 1 1 1 1 3 ...
##  $ booking_status                      : chr  "Not_Canceled" "Not_Canceled" "Canceled" "Canceled" ...

The dataset contains hotel bookings collected from 2017-2018 and is characterized by a total of 19 columns, comprising 5 categorical and 14 numerical variables.

Exploratory Data Analysis (EDA)

1. Summary Statistics

summary(hotel_data)
##   no_of_adults  no_of_children  no_of_weekend_nights no_of_week_nights
##  Min.   :0.00   Min.   : 0.00   Min.   :0.00         Min.   : 0.0     
##  1st Qu.:2.00   1st Qu.: 0.00   1st Qu.:0.00         1st Qu.: 1.0     
##  Median :2.00   Median : 0.00   Median :1.00         Median : 2.0     
##  Mean   :1.84   Mean   : 0.11   Mean   :0.81         Mean   : 2.2     
##  3rd Qu.:2.00   3rd Qu.: 0.00   3rd Qu.:2.00         3rd Qu.: 3.0     
##  Max.   :4.00   Max.   :10.00   Max.   :7.00         Max.   :17.0     
##  type_of_meal_plan  required_car_parking_space room_type_reserved   lead_time  
##  Length:36275       Min.   :0.000              Length:36275       Min.   :  0  
##  Class :character   1st Qu.:0.000              Class :character   1st Qu.: 17  
##  Mode  :character   Median :0.000              Mode  :character   Median : 57  
##                     Mean   :0.031                                 Mean   : 85  
##                     3rd Qu.:0.000                                 3rd Qu.:126  
##                     Max.   :1.000                                 Max.   :443  
##   arrival_year  arrival_month   market_segment_type repeated_guest 
##  Min.   :2017   Min.   : 1.00   Length:36275        Min.   :0.000  
##  1st Qu.:2018   1st Qu.: 5.00   Class :character    1st Qu.:0.000  
##  Median :2018   Median : 8.00   Mode  :character    Median :0.000  
##  Mean   :2018   Mean   : 7.42                       Mean   :0.026  
##  3rd Qu.:2018   3rd Qu.:10.00                       3rd Qu.:0.000  
##  Max.   :2018   Max.   :12.00                       Max.   :1.000  
##  no_of_previous_cancellations no_of_special_requests booking_status    
##  Min.   : 0.00                Min.   :0.00           Length:36275      
##  1st Qu.: 0.00                1st Qu.:0.00           Class :character  
##  Median : 0.00                Median :0.00           Mode  :character  
##  Mean   : 0.02                Mean   :0.62                             
##  3rd Qu.: 0.00                3rd Qu.:1.00                             
##  Max.   :13.00                Max.   :5.00                             
##  avg_price_per_room
##  Min.   :  0       
##  1st Qu.: 80       
##  Median : 99       
##  Mean   :103       
##  3rd Qu.:120       
##  Max.   :540

Summary:
The average price per room in the dataset is 103 euros, with a median of 99 euros, but prices can reach up to 540 euros, indicating high-priced outliers. Some entries even show an average price of zero, possibly reflecting promotional deals. Guests typically stay for two weekday nights and one weekend night, with the average number of weekday nights being 2.2 and weekend nights 0.81. Stays can extend to as many as 17 weekday nights. Most bookings involve two adults, and many guests do not bring children. Lead times vary significantly, with an average of 85 days, a median of 57, and some bookings made up to 443 days in advance, suggesting a right-skewed distribution. The data also shows sparse previous cancellations, with an average of just 0.02, and a maximum of 58. Bookings are spread over 2017 and 2018, peaking in August. Additionally, most guests do not make special requests, as the median is zero, although some make up to five requests per booking.

2. Distribution of Booking Status

print(ggplot(hotel_data, aes(x = booking_status)) +
  geom_bar(fill = "lightblue") +
  labs(title = "Booking Status Distribution", x = "Booking Status", y = "Count"))

Summary
The dataset contains 36,275 bookings. Of the total bookings, 24,390 bookings were not canceled (67.2%), while 11,885 bookings were canceled (32.8%), reflecting a significant cancellation rate that offers rich insights into factors influencing booking decisions.

3. Barplot: Number of Special Requests

print(ggplot(hotel_data, aes(x = factor(no_of_special_requests))) +
  geom_bar(fill = "lightblue") +
  labs(title = "Number of Special Requests", x = "Number of Special Requests", y = "Count") +
  theme_minimal())

Summary
Most guests make no special requests, with a sharp decline as the number increases. A significant portion makes one request, while two or more requests are increasingly rare.

4. Plot: Market Segment vs Booking Status

contingency_table <- table(hotel_data$market_segment_type, hotel_data$booking_status)

plot_data <- as.data.frame(contingency_table)
colnames(plot_data) <- c("Market_Segment", "Booking_Status", "Count")

print(ggplot(plot_data, aes(x = Market_Segment, y = Count, fill = Booking_Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Booking Status by Market Segment",
       x = "Market Segment",
       y = "Count",
       fill = "Booking Status") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)))

Summary
Online market segment have the highest cancellation count, with a large proportion of canceled bookings of 8475. Offline segment has a cancellation count of 3253 For other market segment, the cancellation rates are too less.

5. Histogram: Lead_time

ggplot(hotel_data, aes(x = lead_time, fill = booking_status)) +
  geom_histogram(binwidth = 10, position = "dodge") +
  labs(
    title = "Lead Time vs Booking Status",
    x = "Lead Time (Days)",
    y = "Number of Bookings",
    fill = "Booking Status"
  ) +
  theme_minimal()

Summary
Here we can see that the booking with short lead times are less cancelled and as the lead time increases there are more booking cancellations. Also, we can see more cancellation happening between lead time form 100-200.

6. Boxplot: avg_price_per_room

ggplot(hotel_data, aes(x = booking_status, y = avg_price_per_room, fill = booking_status)) +
  geom_boxplot() +
  labs(
    title = "Average Room Price vs Booking Status",
    x = "Booking Status",
    y = "Average Room Price"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Summary
The booking with cancelled status has higher median average room price value compared to non-cancelled booking status.This might mean that the booking with her average room price have more chances to get cancelled.

Key Statistical Findings

1. Special Requests and Cancellations:
Bookings with special requests had a significantly lower cancellation rate (20.2%) compared to those without (43.2%).  Chi-square test revealed a statistically significant association between special requests and booking status.

2. Previous Cancellation History:
Surprisingly, guests with previous cancellations had a much lower current cancellation rate (4.73%) compared to those without (33.03%).
This suggests that past cancellation history might not be a straightforward predictor of future booking behavior.

3. Factors Influencing Cancellations:
3.1 Lead Time:

  1. Average lead time for canceled bookings: 139.2 days
  2. Average lead time for non-canceled bookings: 58.9 days
  3. Longer lead times strongly correlated with higher cancellation probability.

3.2 Room Price:

  1. Canceled bookings had a higher average room price.
  2. Price difference: Approximately 10.7 units higher for canceled bookings.

4. Seasonal Variations:
The correlation between factors and cancellations varied across seasons:

4.1 Fall:
Strongest lead time correlation (0.54).

4.2 Summer:
Special requests most negatively correlated with cancellations (-0.30).

4.3 Spring:
Unique pattern with special requests having a strong negative correlation (-0.36).

4.4 Winter:
Slightly different dynamics with weaker correlations.

hotel_data$booking_status_binary <- ifelse(hotel_data$booking_status == "Canceled", 1, 0)

remove_zero_variance <- function(df) {
  df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}

data_filtered <- remove_zero_variance(select_if(hotel_data, is.numeric))

cor_data <- cor(data_filtered, use = "complete.obs")

corrplot(cor_data, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix Recap", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

Lead time strongly predicts cancellations, with longer lead times increasing likelihood (0.44). More special requests (-0.25) and repeated guest status (-0.11) reduce cancellations, reflecting customer commitment and loyalty. Price per room has a weak positive correlation (0.14) with cancellations. Factors like travel party size and stay length show minimal impact on cancellation likelihood.

Preliminary Conclusions:

1. Lead time emerged as the most critical factor in predicting booking cancellations.
2. Special requests significantly reduce the likelihood of cancellations.
3. Booking behavior varies considerably across different seasons.
4. Higher-priced rooms show a slight tendency towards more cancellations.

Part 2 Agenda

In this phase, we aim to build a predictive model for hotel booking cancellations and explore key factors driving cancellation behavior. The analysis will address the following critical questions:

Can we predict the likelihood of booking cancellation based on the lead time?
Can we predict peak booking times and high-demand periods?
What is the relationship between room price and the likelihood of a booking being canceled?
How do seasonal trends impact cancellation rates and booking patterns?
Which factors most strongly influence booking cancellation decisions?

To achieve these objectives, we will analyze customer attributes such as room price, lead time, and seasonal trends. Using logistic and Random forest models, we aim to predict the likelihood of booking cancellations and identify actionable insights. The results will help hotels effectively manage cancellation risks, optimize revenue strategies, and better anticipate high-demand periods.

SMART Questions:

1. Can we predict the likelihood of booking cancellation based on the lead time?

set.seed(123)
train_index <- createDataPartition(hotel_data$booking_status_binary, p = 0.7, list = FALSE)
train_data <- hotel_data[train_index, ]
test_data <- hotel_data[-train_index, ]
logistic_model <- glm(
  booking_status_binary ~ lead_time,
  data = train_data, 
  family = binomial()
)

summary(logistic_model)
## 
## Call:
## glm(formula = booking_status_binary ~ lead_time, family = binomial(), 
##     data = train_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.82038    0.02328   -78.2   <2e-16 ***
## lead_time    0.01180    0.00019    62.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 32083  on 25392  degrees of freedom
## Residual deviance: 27091  on 25391  degrees of freedom
## AIC: 27095
## 
## Number of Fisher Scoring iterations: 4
test_predictions <- predict(logistic_model, newdata = test_data, type = "response")
predicted_classes <- ifelse(test_predictions > 0.5, 1, 0)
confusion_matrix <- table(Actual = test_data$booking_status_binary, 
                          Predicted = predicted_classes)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(confusion_matrix)
##       Predicted
## Actual    0    1
##      0 6720  570
##      1 2102 1490
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <- confusion_matrix[2,2] / sum(confusion_matrix[,2])
recall <- confusion_matrix[2,2] / sum(confusion_matrix[2,])
f1_score <- 2 * (precision * recall) / (precision + recall)

cat("\nModel Performance Metrics:\n")
## 
## Model Performance Metrics:
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.754
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.723
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.415
cat("F1 Score:", round(f1_score, 4), "\n")
## F1 Score: 0.527

1. Accuracy: 0.754, meaning the model correctly classifies 75.4% of the bookings.
2. Precision: 0.723, indicating that 72.3% of the predictions of canceled bookings are correct.
3. Recall: 0.415, meaning the model correctly identifies 41.5% of the actual canceled bookings.
4. F1-Score: 0.527, which is the harmonic mean of precision and recall, providing a balanced evaluation of the model’s performance.

roc_curve <- roc(test_data$booking_status_binary, test_predictions)
plot(roc_curve, main = "ROC Curve for Booking Cancellation Prediction")

The plot indicates that as the specificity (the ability to correctly identify non-canceled bookings) increases, the sensitivity (the ability to correctly identify canceled bookings) also increases. This suggests that the model has good discriminative power in predicting booking cancellations.

auc_value <- auc(roc_curve)
cat("Area Under the ROC Curve (AUC):", round(auc_value, 4), "\n")
## Area Under the ROC Curve (AUC): 0.75

Area Under the ROC Curve (AUC): 0.75, suggesting the model has good discriminative power in predicting booking cancellations.

hotel_data$booking_status_binary <- as.numeric(as.character(hotel_data$booking_status_binary))

ggplot(hotel_data, aes(x = lead_time, y = booking_status_binary)) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE) +
  labs(
    title = "Probability of Booking Cancellation vs Lead Time",
    x = "Lead Time (Days)",
    y = "Probability of Cancellation"
  ) +
  theme_minimal()

The graph shows the relationship between the lead time (in days) and the probability of booking cancellation. As the lead time increases, the probability of booking cancellation rises in a non-linear fashion, with the curve becoming more steep at higher lead times.

2. Can we predict high-demand periods?

prepare_booking_data <- function(hotel_data) {
  booking_data <- hotel_data %>%
    mutate(
      high_demand = as.factor(ifelse(
        avg_price_per_room > quantile(avg_price_per_room, 0.75), 
        1, 0
      )),
      booking_month = as.factor(arrival_month),
      is_weekend = no_of_weekend_nights > 0,
      total_nights = no_of_weekend_nights + no_of_week_nights,
      is_repeated_guest = repeated_guest > 0
    )
  
  return(booking_data)
}
prepared_data <- prepare_booking_data(hotel_data)
set.seed(123)
train_index <- createDataPartition(prepared_data$high_demand, p = 0.7, list = FALSE)
train_data <- prepared_data[train_index, ]
test_data <- prepared_data[-train_index, ]
# Random Forest Model
rf_model <- randomForest(
  high_demand ~ lead_time + 
                total_nights + 
                booking_month + 
                market_segment_type + 
                is_repeated_guest,
  data = train_data,
  ntree = 500,
  importance = TRUE
)

evaluate_model <- function(actual, predicted, model_name) {
  conf_matrix <- confusionMatrix(as.factor(predicted), as.factor(actual))
  
  results <- data.frame(
    Model = model_name,
    Accuracy = conf_matrix$overall['Accuracy'],
    Precision = conf_matrix$byClass['Precision'],
    Recall = conf_matrix$byClass['Recall'],
    F1 = conf_matrix$byClass['F1']
  )
  
  return(results)
}

rf_pred <- predict(rf_model, newdata = test_data, type = "prob")[,2]
rf_class <- ifelse(rf_pred > 0.5, 1, 0)
rf_results <- evaluate_model(test_data$high_demand, rf_class, "Random Forest")
print(rf_results)
##                  Model Accuracy Precision Recall    F1
## Accuracy Random Forest     0.81     0.847  0.912 0.878
rf_importance <- data.frame(
  Feature = rownames(importance(rf_model)),
  Importance = importance(rf_model)[,1]
) %>% arrange(desc(Importance))
print(head(rf_importance, 10))
##                                 Feature Importance
## booking_month             booking_month      100.9
## lead_time                     lead_time       92.9
## market_segment_type market_segment_type       85.6
## total_nights               total_nights       50.4
## is_repeated_guest     is_repeated_guest       12.9
ggplot(head(rf_importance, 10), aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Top 10 Features Predicting High-Demand Periods",
    x = "Features",
    y = "Importance"
  ) +
  theme_minimal()

predict_high_demand_periods <- function(model, test_data) {
  pred_probs <- predict(model, newdata = test_data, type = "prob")[,2]
  high_demand_periods <- test_data[pred_probs > 0.7, ]
  
  print("High-Demand Periods Prediction:")
  print(paste("Total High-Demand Periods:", sum(pred_probs > 0.7)))
  print("Sample of High-Demand Periods:")
  print(head(high_demand_periods))
  
  return(high_demand_periods)
}

high_demand_periods <- predict_high_demand_periods(rf_model, test_data)
## [1] "High-Demand Periods Prediction:"
## [1] "Total High-Demand Periods: 1462"
## [1] "Sample of High-Demand Periods:"
##     no_of_adults no_of_children no_of_weekend_nights no_of_week_nights
## 62             1              0                    0                 3
## 110            2              0                    2                 1
## 181            2              0                    0                 3
## 202            2              0                    0                 2
## 223            2              0                    1                 5
## 232            2              1                    1                 3
##     type_of_meal_plan required_car_parking_space room_type_reserved lead_time
## 62        Meal Plan 1                          0        Room_Type 4        19
## 110       Meal Plan 1                          0        Room_Type 1        32
## 181       Meal Plan 1                          0        Room_Type 1        11
## 202       Meal Plan 1                          0        Room_Type 1        80
## 223       Meal Plan 1                          0        Room_Type 1        56
## 232       Meal Plan 1                          0        Room_Type 1        77
##     arrival_year arrival_month market_segment_type repeated_guest
## 62          2018             5              Online              0
## 110         2017             9              Online              0
## 181         2017             9              Online              0
## 202         2017             7              Online              0
## 223         2018             9              Online              0
## 232         2018             5              Online              0
##     no_of_previous_cancellations no_of_special_requests booking_status
## 62                             0                      2   Not_Canceled
## 110                            0                      3   Not_Canceled
## 181                            0                      1   Not_Canceled
## 202                            0                      1       Canceled
## 223                            0                      0   Not_Canceled
## 232                            0                      1   Not_Canceled
##     avg_price_per_room booking_status_binary high_demand booking_month
## 62               120.1                     0           1             5
## 110               94.5                     0           0             9
## 181               80.8                     0           0             9
## 202               76.5                     1           0             7
## 223              119.0                     0           0             9
## 232              135.2                     0           1             5
##     is_weekend total_nights is_repeated_guest
## 62       FALSE            3             FALSE
## 110       TRUE            3             FALSE
## 181      FALSE            3             FALSE
## 202      FALSE            2             FALSE
## 223       TRUE            6             FALSE
## 232       TRUE            4             FALSE

3. What is the relationship between room price and the likelihood of a booking being canceled or not?

# Fit a logistic regression model
lrmodel1 <- glm(booking_status_binary ~ avg_price_per_room, data = hotel_data, family = binomial)

summary(lrmodel1)
## 
## Call:
## glm(formula = booking_status_binary ~ avg_price_per_room, family = binomial, 
##     data = hotel_data)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.632519   0.036523   -44.7   <2e-16 ***
## avg_price_per_room  0.008695   0.000326    26.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45887  on 36274  degrees of freedom
## Residual deviance: 45153  on 36273  degrees of freedom
## AIC: 45157
## 
## Number of Fisher Scoring iterations: 4
hotel_data$predicted_prob <- predict(lrmodel1, type = "response")

hotel_data$predicted_class <- ifelse(hotel_data$predicted_prob > 0.5, 1, 0)

# Model evaluation
conf_matrix <- table(hotel_data$booking_status_binary, hotel_data$predicted_class)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.67"
roc_curve <- roc(hotel_data$booking_status_binary, hotel_data$predicted_prob)
plot(roc_curve)

auc_value <- auc(roc_curve)
print(paste("AUC:", round(auc_value, 2)))
## [1] "AUC: 0.6"
# Scatter plot to examine the relationship between price and cancellation
plot(hotel_data$avg_price_per_room, hotel_data$booking_status_binary, 
     main = "Scatter plot of Avg Price per Room vs Booking Status", 
     xlab = "Average Price per Room", ylab = "Booking Status")

# Boxplot for average price per room vs booking cancellation status
boxplot(avg_price_per_room ~ booking_status_binary, data = hotel_data, 
        main = "Boxplot of Avg Price per Room vs Booking Status", 
        xlab = "Booking Status", ylab = "Average Price per Room")

5. What key factors most strongly influence booking cancellations?

hotel_data$type_of_meal_plan <- as.factor(hotel_data$type_of_meal_plan)
hotel_data$room_type_reserved <- as.factor(hotel_data$room_type_reserved)
hotel_data$market_segment_type <- as.factor(hotel_data$market_segment_type)
hotel_data$arrival_month <- as.factor(hotel_data$arrival_month)
rf_model1 <-randomForest(booking_status_binary ~ no_of_adults + no_of_children + no_of_weekend_nights + 
                          no_of_week_nights + type_of_meal_plan + required_car_parking_space + 
                          room_type_reserved + lead_time + arrival_year + arrival_month + 
                          market_segment_type + repeated_guest + 
                          no_of_previous_cancellations + avg_price_per_room + no_of_special_requests,
                        data = hotel_data, importance = TRUE, ntree = 100)

feature_importance <- importance(rf_model1)

importance_df <- data.frame(Feature = rownames(feature_importance), Importance = feature_importance[,1])

importance_df <- importance_df[order(-importance_df$Importance),]
importance_df
##                                                   Feature Importance
## lead_time                                       lead_time     214.99
## no_of_special_requests             no_of_special_requests     183.66
## arrival_month                               arrival_month     109.45
## avg_price_per_room                     avg_price_per_room     104.81
## market_segment_type                   market_segment_type      97.86
## no_of_week_nights                       no_of_week_nights      52.54
## no_of_weekend_nights                 no_of_weekend_nights      48.14
## required_car_parking_space     required_car_parking_space      42.51
## no_of_adults                                 no_of_adults      33.90
## type_of_meal_plan                       type_of_meal_plan      32.13
## room_type_reserved                     room_type_reserved      31.48
## arrival_year                                 arrival_year      25.43
## no_of_children                             no_of_children      19.48
## repeated_guest                             repeated_guest      10.73
## no_of_previous_cancellations no_of_previous_cancellations       4.74
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # Flip the axes to make it easier to read
  labs(title = "Feature Importance from Random Forest Model",
       x = "Feature", y = "Importance") +
  theme_minimal()

# Converting booking status to numeric
hotel_data$booking_status_binary <- as.numeric(hotel_data$booking_status_binary)
hotel_data$market_segment_type_num <- as.numeric(hotel_data$market_segment_type)
hotel_data$arrival_month_num <- as.numeric(hotel_data$arrival_month)

corr_data <- hotel_data[, c("booking_status_binary", "lead_time", "no_of_special_requests", 
                            "avg_price_per_room", "market_segment_type_num", "no_of_week_nights", 
                            "no_of_weekend_nights", "arrival_month_num")]

#correlation matrix
cor_matrix <- cor(corr_data)

print(cor_matrix)
##                         booking_status_binary lead_time no_of_special_requests
## booking_status_binary                  1.0000   0.43854                -0.2531
## lead_time                              0.4385   1.00000                -0.1016
## no_of_special_requests                -0.2531  -0.10164                 1.0000
## avg_price_per_room                     0.1426  -0.06260                 0.1844
## market_segment_type_num                0.1360  -0.00693                 0.3085
## no_of_week_nights                      0.0930   0.14965                 0.0460
## no_of_weekend_nights                   0.0616   0.04660                 0.0606
## arrival_month_num                     -0.0112   0.13681                 0.1106
##                         avg_price_per_room market_segment_type_num
## booking_status_binary              0.14257                 0.13601
## lead_time                         -0.06260                -0.00693
## no_of_special_requests             0.18438                 0.30848
## avg_price_per_room                 1.00000                 0.37565
## market_segment_type_num            0.37565                 1.00000
## no_of_week_nights                  0.02275                 0.11295
## no_of_weekend_nights              -0.00452                 0.12907
## arrival_month_num                  0.05442                -0.00631
##                         no_of_week_nights no_of_weekend_nights
## booking_status_binary              0.0930              0.06156
## lead_time                          0.1497              0.04660
## no_of_special_requests             0.0460              0.06059
## avg_price_per_room                 0.0228             -0.00452
## market_segment_type_num            0.1130              0.12907
## no_of_week_nights                  1.0000              0.17958
## no_of_weekend_nights               0.1796              1.00000
## arrival_month_num                  0.0374             -0.00989
##                         arrival_month_num
## booking_status_binary            -0.01123
## lead_time                         0.13681
## no_of_special_requests            0.11055
## avg_price_per_room                0.05442
## market_segment_type_num          -0.00631
## no_of_week_nights                 0.03738
## no_of_weekend_nights             -0.00989
## arrival_month_num                 1.00000
# Plotting the correlation matrix using corrplot
corrplot(cor_matrix, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)

Building booking cancellation predictive model using the key factors

Checking for multicollinearity

# Assuming you've already fitted the logistic regression model
lm_model1 <- lm(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room + 
                 market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month,
                 data = hotel_data)
#summary(lm_model1)

gvif_values <- vif(lm_model1)

print(gvif_values)
##                        GVIF Df GVIF^(1/(2*Df))
## lead_time              1.25  1            1.12
## no_of_special_requests 1.21  1            1.10
## avg_price_per_room     1.44  1            1.20
## market_segment_type    1.70  4            1.07
## no_of_week_nights      1.08  1            1.04
## no_of_weekend_nights   1.06  1            1.03
## arrival_month          1.36 11            1.01

Building model using Random Forest

hotel_data$booking_status_binary <- as.factor(hotel_data$booking_status_binary)

# Fit Random Forest model for classification (binary target)
rf_model2 <- randomForest(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room + 
                          market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month, 
                          data = hotel_data, ntree = 100)

# Print the model summary to ensure it's treated as classification
print(rf_model2)
## 
## Call:
##  randomForest(formula = booking_status_binary ~ lead_time + no_of_special_requests +      avg_price_per_room + market_segment_type + no_of_week_nights +      no_of_weekend_nights + arrival_month, data = hotel_data,      ntree = 100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 10.8%
## Confusion matrix:
##       0    1 class.error
## 0 22868 1522      0.0624
## 1  2386 9499      0.2008
pred_class_rf <- predict(rf_model2, newdata = hotel_data)

# Confusion Matrix (Make sure both predicted and actual values have the same levels)
conf_matrix_rf <- confusionMatrix(pred_class_rf, hotel_data$booking_status_binary)
conf_matrix_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 23496  1592
##          1   894 10293
##                                         
##                Accuracy : 0.931         
##                  95% CI : (0.929, 0.934)
##     No Information Rate : 0.672         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.842         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.963         
##             Specificity : 0.866         
##          Pos Pred Value : 0.937         
##          Neg Pred Value : 0.920         
##              Prevalence : 0.672         
##          Detection Rate : 0.648         
##    Detection Prevalence : 0.692         
##       Balanced Accuracy : 0.915         
##                                         
##        'Positive' Class : 0             
## 
# Print confusion matrix and the calculated metrics

conf_matrix_df <- as.data.frame(as.table(conf_matrix_rf))
conf_matrix_df
##   Prediction Reference  Freq
## 1          0         0 23496
## 2          1         0   894
## 3          0         1  1592
## 4          1         1 10293
colnames(conf_matrix_df) <- c("Actual", "Predicted", "Freq")  # Rename columns for clarity

# Plot confusion matrix as heatmap using ggplot2
ggplot(conf_matrix_df, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "black", size = 6) +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(x = "Predicted", y = "Actual", title = "Confusion Matrix Heatmap") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Extracting Precision, Recall, and Accuracy from the confusion matrix
precision <- conf_matrix_rf$byClass['Pos Pred Value']
recall <- conf_matrix_rf$byClass['Sensitivity']
accuracy <- conf_matrix_rf$overall['Accuracy']

cat("Precision:", round(precision, 4), "\n")
## Precision: 0.936
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.963
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.931
# Predict class probabilities for the binary outcome (probabilities for each class, we need the second column for class 1)
pred_probs_rf <- predict(rf_model2, newdata = hotel_data, type = "prob")[, 2]

# Calculate ROC-AUC for Random Forest (use the second column for class 1 probabilities)
roc_curve_rf <- roc(hotel_data$booking_status_binary, pred_probs_rf)

# Plot the ROC curve
plot(roc_curve_rf, main = "ROC Curve for Random Forest Model")

# Print AUC value
auc_value_rf <- auc(roc_curve_rf)
cat("AUC:", round(auc_value_rf, 4), "\n")
## AUC: 0.986

Validating the model by cross-validation with Random Forest using 5-fold cross-validation

# Set up parallel processing (use all but 1 core)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

# Reduce number of trees for faster cross-validation
ntree_val <- 50

# Perform cross-validation with Random Forest using 5-fold cross-validation
cv_model <- train(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room + 
                  market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month, 
                  data = hotel_data,
                  method = "rf",
                  trControl = trainControl(method = "cv", number = 5),  # 5-fold cross-validation
                  tuneGrid = data.frame(mtry = 3),  # You can adjust mtry for tuning
                  ntree = ntree_val)  # Set ntree to a smaller value for faster computation

# Print the results of cross-validation
print(cv_model)
## Random Forest 
## 
## 36275 samples
##     7 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 29020, 29020, 29020, 29020, 29020 
## Resampling results:
## 
##   Accuracy  Kappa
##   0.865     0.679
## 
## Tuning parameter 'mtry' was held constant at a value of 3
# Stop the parallel cluster
stopCluster(cl)

References

  1. Antonio, N., de Almeida, A., & Nunes, L. (2017). Predicting hotel bookings cancellation with a machine learning classification model. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 1100–1107). IEEE. https://ieeexplore.ieee.org/document/8260781

  2. Antonio, N., de Almeida, A., & Nunes, L. (2017). Predicting hotel booking cancellations to decrease uncertainty and increase revenue.Tourism & Management Studies, 13(2), 25–39. https://www.researchgate.net/publication/310504011_Predicting_Hotel_Booking_Cancellation_to_Decrease_Uncertainty_and_Increase_Revenue

  3. Abeyrathne, C., & Bandara, H. M. N. D. (2023). Hotel booking cancellation prediction system using machine learning. Faculty of Engineering, University of Ruhuna. https://www.researchgate.net/publication/380515766_Hotel_Booking_Cancellation_Prediction_System_using_Machine_Learning